This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 



BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of the 
original documents submitted by the applicant. 

Defects in the images may include (but are not limited to): 

• BLACK BORDERS ( 

• TEXT CUT OFF AT TOP, BOTTOM OR SIDES 

• FADED TEXT 

• ILLEGIBLE TEXT 

• SKEWED/SLANTED IMAGES 

• COLORED PHOTOS 

• BLACK OR VERY BLACK AND WHITE DARK PHOTOS 

• GRAY SCALE DOCUMENTS 

IMAGES ARE BEST AVAILABLE COPY. 



As rescanning documents will not correct images, 
please do not report the images to the 
Image Problems Mailbox. 



Al IS IE\T-lllSlA) ISI-'ORMATIOS ri-truaal 



A Unified Approach to 
Automatic Indexing and 
Information Retrieval 



Altai Ginsberg, AT&T Befl Laboratories 



ICCORDING TO CONVENTIONAL 
wisdom, wc are living at the dawn of an 
information age, in which the united power 
of computers and high-bandwidth com- 
munications will give us immediate ac- 
cess to the entire corpus of human knowl- 
edge. At the same time, however, people 
are deluged daily by hundreds of elec- 
tronic mail messages and thousands of 
articles on bulletin boards. Thus, we arc 
also witnessing the beginning of informa- 
tion overload. 

To reap the benefits of the information 
age without being jaded by overload, we 
need a coherent strategy for organizing 
vast quantities of information and provid- 
ing casy-to-usc retrieval tools. WorldViews 
is an experimental system built on such a 
strategy: to unite as many aspects of infor- 
mation organization as possible around one 
simple, familiar, yet versatile knowledge 
representation framework. It uses a lattice- 
structured version of the traditional the- 
saurus framework.'* 2 The term "thesau- 
rus" refers here lo a structured set of subject 
headings or descriptors, rather than a liter- 
ary thesaurus, which is a list of words and 
their synonyms. Drawing on ideas and tech- 
niques from information retrieval and Al. 
WorldViews focuses on the unified use of 
a thesaurus to automatically index and re- 



The strategy of the WorldViews system is to unite 
as many aspects of information organization as 
possible around one simple, familiar, yet versatile 
knowledge representation framework: a lattice- 
structured version of the traditional thesaurus. 



trieve information. Another aspect of in- 
formation organization, the task of alerting 
users to relevant documents based on user 
interest profiles, can easily be encompassed 
by this strategy but is not discussed here. 

Thesauri and related conceptual frame- 
works have long been a staple of library 
science and are used in numerous on-line 
databases. 2 However, these databases rely 
on human experts to index documents 
using terms from an associated controlled 
vocabulary or thesaurus. Moreover, few 
scientists use the logical structure of the- 
sauri for retrieval or as a natural frame- 
work for exploring collections. White 
some research has concerned automatic 
thesaurus construction, 3 the idea of using 
on-line thesauri as a fundamental orga- ; 
nizing principle in automatic indexing and . 



information retrieval has received tittle at- 
tention. (The sidebar on pages 50-5 1 dis- 
cusses related work in the Al community.) 

Metaphorically, a thesaurus provides a 
conceptual map of the information space 
corresponding to its areas of application. 
WorldViews realizes this metaphor by in- 
tegrating query-initiated search with sub- 
sequent navigation through a well-defined 
space of relevant subtopics. This space is 
not limited to statically related subtopics, 
but can include the dynamic calculation of 
new subspaces defined by previously unre- 
lated conjunctions of thesaurus entries (de- 
scribed in more detail later). 

WorldViews has been used to process 
electronic news articles as well as abstracts 
of technical reports from Bell Labs and 
other organizations. These documents 
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cover a wide range of topics, ranging from 
relatively academic pieces about neural 
networks, to pragmatic reports about soft- 
ware requirements, to articles on affirma- 
tive action. Due to the proprietary nature of 
Bell Labs reports, the examples in this 
article use only non-Belt abstracts. 

Overview of WorldViews 

World Views consists of a system for 
automatic document indexing, an in- 
formation retrieval system, and a user 
interface. These subsystems are unified 
in the sense that they all rely heavily on 
one on-line thesaurus. WorldViews is 
written in C. and uses X Windows for its 
user interface. 

Automatic indexing is the process by 
which thesaurus entries are automatically 
assigned to documents as content descrip- 
tors. The automatic-indexing subsystem 
currently runs in batch mode as a separate 
process. Its input is a set of documents, D, 
and the thesaurus, which provides it with 
descriptors. The output is an updated ver- 
sion of the thesaurus: Any thesaurus 
entry thai has been used to index a 
document in £ will now include a pointer to, 
or a posting for, that document Using a 
constrained form of spreading activation in 
a semantic network, 4 this subsystem traces 
connections among concepts to draw out a 
document's implicit conceptual content 
based on the explicit concept references it 
contains. For example, a document that 
explicitly uses the concepts "wasps" and 
"mosquitoes" but never uses the term "in- 
sects" will still be indexed under both the j 
more specific and the more general terms. I 

Automatic indexing also estimates the ' 
percentage of a document's content that is t 
relevant to each of the assigned descrip- j 
tors. This is called the content number for 1 
that document with respect to a thesaurus 
element. Thus, in the previous example, 
the document's content number relative to 
the term "insects" will he computed by 
combining the content numbers associated 
with the narrower terms "wasps'* and "mos- 
quitoes." 

When the same term represents more 
than one word sense, the indexing sub- 
system computes distances among thesau- 
rus entries to choose the correct term to 
apply. For example, the word "programs** 
is a constituent of several entries in the 
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Figure 1. WorldViews display offer the user types "lelecommiaUatlbfis. 



thesaurus, including "computer programs." 
"social programs," and so on. Before using 
this term to index a document, the system 
will try to determine the correct sense of 
the term. 

The indexing subsystem is integrated 
with portions of a Bell Labs information 
retrieval system called Slimmer. 3 Essen- 
tially, WorldViews uses Slimmer to create 
inverted files of terms that occur in docu- 
ment collections but are not contained in 
the thesaurus. This allows WorldViews to 
respond to user queries that it cannot fully 
interpret relative to its thesaurus. This sim- 
ple expedient guarantees that, in the worst 
case. WorldViews* levels of recall and 
precision will be on a par with those of a i 
traditional keyword system. Since most : 
real- world document collections are con- j 
tinually expanding, while thesauri are up- j 
dated only periodica tly. some approach of j 
this sort is necessary to make WorldViews ' 
reliable in practical applications. 

The information retrieval subsystem uses 
the thesaurus in a number of ways. First, it 
tries to interpret user queries as conjunc- 
tions of thesaurus entries. It also uses the 
postings associated with thesaurus entries 
to retrieve corresponding documents. When 
processing an interpreted query, the sys- 
tem uses the thesaurus to find related 
subtopics that will be incorporated as part 
of its response. As later examples will 



show, this capability helps brings the infor- 
mation space metaphor to life. 

In the example in Figure I, the user has 
typed in the query "telecommunications," 
which is echoed in the User Queries win- 
dow. The system has responded in two 
ways. In the Retrieved Titles window, the 
system has listed the titles of documents 
that it has judged to be relevant (in de- 
scending order of the percentage of the 
document considered relevant). It has also 
printed a message in the Messages window 
saying how many titles are in the scrollable 
list, and the range of the percentage of 
relevant content At the same time, in the 
window labeled Subtopics: Level I. the 
system has displayed a scrollable list of 
more specific subtopics related to the us- 
er's query. Each item in this list applies to 
one or more documents in the collection, 
so these are all possible options. 

The user can now choose to view a 
document by mouse-clicking on the title, 
or click on an item in the subtopic window. 
In our example, the user clicks on the 
subtopic item "transmission," which caus- 
es a new subtopic level (Level 2) to be 
displayed. The user now selects "signal 
interference" in this window, and then se- 
lects the title "Study of UHF Television 
Receiver Interference Immunities," which 
leads to the display shown in Figure 2. The 
messages in the WorldViews Messages 
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window reflect the results of the user's two 
subtopic selections. (1*11 use this document 
as an example throughout the article.) 
, The user can back up and select a differ- i 
ent subtopic by clicking on any item in any ] 
subtopic window, in Figure 2, if the user 
were to select the Level I item "telecom- 
munication equipment," the Level 3 subtop- 
ic window would disappear, and the Level 
2 subtopic window would he repoputated 
with new subtopics. The row of subtopic ' 
windows shifts to the left (off the screen) if ' 
more space is needed to accommodate deep- ; 
cr levels of subtopics. 



Lattice-structured thesauri 

When I began work on the World Views j 
project I needed to obtain or build a lattice- ; 
structured thesaurus. Since the original ; 
documents were to be messages drawn ! 
from electronic bulletin boards, profes- 
sionally constructed academic thesauri 
would likely be inadequate. For example. 



word sense information would be vital to 
disambiguation, but professionally con- 
structed thesauri do not contain this in- 
formation in machine-usable format, if at 
all. Thus the WorldLattice was created — 
my attempt to build a thesaurus for every- 
day life, so to speak. It currently consists 
of roughly 3,000 nodes (preferred terms), 
with an additional 500 or so equivalent 
terms or aliases. It also contains 391 lex- 
ical entries that are associated with two or 
more senses (nodes). j 
As the project** focus shifted to Bell I 
Labs technical reports, the WorldLattice [ 
grew to resemble a technical thesaurus, j 
Eventually, therefore. I obtained IEEE's \ 
In spec thesaurus. 6 The latest edition of ! 
In spec deals with the fields of physics, j 
electronics, and computing. According to an : 
Inspec representative, from 1 973 through 1 99 1 
the thesaurus grew from 4.000 (o 6,000 
preferred terms (used in indexing) and from . 
3.000 to 7.000 lead-in terms (cross-refer- 
ences). This supports the view that thesauri 
are never complete, so that the supplemental 



use of an inverted file in a system like 
World Views will always be needed. 

I converted the Inspec ou-line thesaurus 
into World Views format and added some 
entries to let the system apply its word 
sense disambiguation technique. The mod- 
ified thesaurus contains 620 lexical entries 
with two or more associated senses. (From 
here on, "Inspec" refers exclusively to the 
modified version.) All the World Views 
techniques have been successfully applied 
to the Bell Labs document collection using 
the Inspec thesaurus. 

While this article focuses mainly on re- 
sults obtained using the WorldLattice, 
WorldViews can be used with any lattice- 
structured thesaurus. (I will also present 
some results obtained using Inspec.) 

Properties of lattice-structured thesauri 

A lattice^siructured thesaurus is a lattice of 
word or phrase senses or concepts. Mathe- 
matically speaking, a lattice is a partially 
ordered sei in which every pair of elements 
has both a greatest lower bound and a least 
upper bound. Greatest lower bounds are of 
no interest in this context, but the idea of 
looking at the least upper bounds of two 
concepts in a lattice- structured thesaurus 
turns out to be useful, as wc shall see later. 
(In contrast to strict mathematical usage, 
n on unique least upper bounds are allowed.) 

Following thesaurus terminology, the 
nodes of a lattice-structured thesaurus are 
related via the broader term (BT) relation 
and its inverse, the narrower term (NT) 
relation. 2 Thus, for example, the concept 
of physics is a narrower term (more specif- 
ic) relative to the sciences, while the latter 
is a broader term (more general) relative to 
the former. In thb article, these relations 
arc closed under the transitivity property. 
For example, if Z is narrower than Y. and 
Y is narrower than X. then Z is narrower 
than X. Viewed in this way. these relations 
are basically equivalent to the more-gentr- , 
al-ttum and more- specific-than relations, j 
more commonly used in A I work on sub- j 
sumption. | 

In many thesauri, entries can be associ- j 
aled via the related term (RT) relation, 
which is distinct from BT/NT. While the ! 
BT/NT relationship is reserved for genus/ * 
species relationships, the RT relationship , 
represents other forms of relatcdncss. 2 For 
example, we should use the RT relation to 
connect the terms "Vietnam War" and "Viet- 
nam War veterans."* However, it also makes 
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sense to regard ihe latter term as an NT 1 
with respect to the former, thus avoiding \ 
the use of RT. j 

Unlike the BT/NT relations. RT is sym- 
metric that is, if A bears the RT relation to ■ 
B. then B bears it to A. This means that the = 
RT relation docs not induce a partial order- 
ing over the elements in its domain and, 
therefore, does not generate a lattice. For | 
this reason, and because many RT tnstanc- ' 
cs could be reformulated as BT/NT rela- j 
Hons, earlier versions of WorldViews did i 
not use RT at all (the current version does 
allow RT). j 
Thesauri generally offer some way to j 
encode the fact that different character I 
strings can refer to the same topic; for 
example, "AT&T/* "American Telephone 
and Telegraph," and ~ American Telephone j 
& Telegraph" all refer to the same entity. 
Thus, each WorldViews thesaurus entry | 
has an official label and, possibly, aliases. ; 
(Aliases play the role of equivalent terms ! 
in thesauri.) When the indexing system ] 
starts, it compiles the labels and their alias- 
es into finite state machines for text pro- 
cessing. WorldLattice node names, their 
aliases, and their constituent words are 
called texicat entries. 

Figure 3 shows an example of World- 
Views' original, simple mode of thesaurus 
definition, in this case a file that World- 
Views read to create the WorldLattice (the 
current mode uses a database of records 
with field specifiers). Each fine describes 
a set of links in the WorldLattice. The first 
node name in a line is the more general 
node, and the rest are more specific nodes. 
Some node names contain one or more 
bracketed numerals that differentiate which 
sense is meant: transmission 1 1 J" refers to 
telecommunications transmissions, while 
~transmission[2r refers to vehicle trans- 
missions. When WorldViews reads in the 
WorldLattice. it automatically generates 
lexical entries for the node names in this 
file. Any word with bracketed numerals 
will have the requisite number of senses 
associated with it. 

In Figure 3. the first occurrence of the 
word "transmission" has a "S" after it. 
which specifies a default sense. As the 
name implies, the system assigns the de- 
fault sense to a word if the word sense 
disambiguation procedure fails to pick a 
sense. Default senses are a powerful de- 
vice, but must be used with caution. A 
default sense might need to be removed or 



changed depending on the documents be- 
ing analyzed. 

When WorldViews reads in the World - 
Lattice file, it creates a data structure for 
each node specified. Each data structure 
contains pointers to its parents (nodes whose 
topics are more general) and iu children 
(nodes whose topics are more specific). 
The system also computes path informa- 
tion concerning the node's location in the 
I WorldLattice, and stores this information 
I in the data structure for the node. This 
' allows WorldViews to perform fast World- 
| Lattice searches and traversal and quickly 
j determine subsumption relations and least 
I upper bounds among nodes. Basically, the 
logical relationshipbclween any two nodes 
can be determined without examining any 
j other node in the WorldLattice. 

! Automatic indexing 

| WorldViews' automatic-indexing pro- 
' cedure has two phases: Generating a list of 
I WorldLattice concepts explicitly referenced 
by the document, and generating a map of 
the document that also includes implicitly 
| referenced concepts. 



Phase one. WorldViews scans each docu- 
ment once, determines explicit concept ref- 
erences, and tries to resolve the intended 
senses of polysemous words (words with 
more than one meaning). A word or phrase 
Win a piece of text explicitly references a 
concept (node) C in the WorldLattice if W 
is a syntactic variant of Cs label or an alias 
for C, and if the author* s intended use of IV 
is identical to the intended sense of C in the 
WorldLattice. So, for example, the sen- 
tence "The cat is on the mat," contains an 
explicit reference to the WorldLattice node 
cats, but it does not contain an explicit 
reference to the more general node mam- 
mals. The text does, however, contain an 
implicit reference to this more general node. 
If a piece of text W contains an explicit 
reference to a WorldLattice concept t\ it 
also contains implicit references to uny 
node that is more general than C. 

When scanning a document. WorldViews 
looks at each word in order and does a table 
lookup to see whether there is a WorldLat- 
tice label or alias that the word matches or 
for which it could be the beginning. For 
example, the word "American** could be a 
direct match into a lexical entry, or it could 



be the beginning of a phrase thai matches 
an entry such as "American Telephone and 
Telegraph" or " American Express." When 
the system does a tuble lookup on M Amer- 
ican.** it finds that a representation of a 
finite slate recognizer is associated with 
this word. The system will use this recog- 
nizer to continue scanning until either some 
phrase is matched (the system looks for the 
longest possible matches), or no further 
match is found. Thus, if the system were 
scanning the phrase "American's in for- 
eign ..." the system would match the word 
"American's" to "American.'* and the fi- 
nite state machine would terminate once 
the word "in" is scanned, since no World- 
Laitice node is labeled with a phrase begin- 
ning "American in." As mentioned in the 
earlier overview, any word that fails to be 
matched and is not on the stop list will be 
inserted in an inverted index. 

Thus, the lexical and syntactic analysis 
of the WorldViews text processor is fairly 
: simple. Each call to the scanning proce- 
I dure returns a structure containing pointers 
to the next lexical entry recognized in the 
text together with a pointer to the associat- 
j ed WorldLattice entry if the entry is not 
- polysemous. If the entry is polysemous. 
| the structure contains pointers to all the 
possible WorldLattice senses. 

Using locality for word sense disambig- 
uation. WorldViews uses the information 
gathered during the first phase to accumu- 
late evidence concerning the intended sense 
of polysemous words. Suppose WorldViews 
is processing a document and finds a word 
or phrase that it knows has multiple mean- 
ings. For example, "interference" in the 
document in Figure 2 has two senses in the 
WorldLattice: One is the sense used in 
"electromagnetic interference." and the 
other is used in "there was interference on 
the play" with respect to football. Howev- 
er, the system must really choose among 
six possibilities to resolve the polysemy. 
The four other possibilities arise because 
; 'interference** is a constituent in other 
j WorldLattice nodes: signal interference. 

I wave interference, and electromagnetic in- 
terference use the first sense of the word, 
and pass interference uses the second. 
| WorldLattice nodes whose labels share a 

I common constituent sense are called sub- 
senses of that sense. 
In the current implementation, the sys- 
j tern can resolve an ambiguity only in terras 
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Mated work 

Since the body of related work in auto- 
matic indexing, word sense disambigua- 
tion, and information retrieval is so large, 
this discussion focuses mainly on recent 
research in the AI community. 

Natural language processing. Much of 
the AI work in all three areas relies to 
wme extent on the use of natural language 
processing techniques, in particular, pars- 
ing algorithms. The Scisor' and Ferret 2 
systems represent this school of research. 
Using frame-based knowledge representa- 
tion techniques, they try to generate rich 
semantic representations of text. This 
makes it easier to develop natural-lan- 
guage question-answer user interfaces, 
which the authors claim yield higher rates 
of precision and recall. However, such 
systems typically deal with specialized 
domains. Scisor, for example, analyzes fi- 
nancial news stories concerning mergers 
and acquisitions; the system has been used 
in other domains, but still rather narrow 
ones. 1 

Aside from simple finite-state recogniz- 
ers, WoridViews uses no parsing or related 
natural language processing techniques. In 
principle, such techniques could be inte- 
grated into WoridViews, either as a front 
end to the information retrieval system, or 
into the automatic-indexing system. How- 
ever, since WoridViews* scope is essen- 
tially unrestricted, the effort involved in 
gathering and maintaining the necessary 
domain knowledge to make natural lan- 
guage techniques worthwhile would be 
prohibitive. 



Spreading activation in semantic nets. 
Another major theme in AI work in these 
areas is the use of spreading activation in 
semantic networks. The Grant system, for 
example, uses this technique to match de- 
scriptions of relevant funding agencies to 
descriptions of grant proposals. 3 This is 
similar to WoridViews* use of its thesau- 
rus to derive a document's implicit con- 
ceptual content from what the system has 
determined. to be its explicit conceptual 
content. Grant does not do uutomatic in- 
dexing or word sense disambiguation; 

A key difference between Grant and 
WoridViews is in the nature of the con- 
straints applied to the spreading activation 
technique. Grant uses distance and fan-out 
restrictions as well as rule-like constraints, 
called path endorsements, to limit the 
nodes that can be activated. WoridViews* 
constraint on spreading activation is based 
entirely on the logical structure of the the- 
saurus and the nature of the task at hand. 
Thus, during automatic indexing, when the 
system uses spreading activation to estab- 
lish a document's implicit conceptual con- 
tent given its explicit content, the activa- 
tion can only go upward from more 
specific to more general terms. 

Hirst and Charniak's use of marker 
passing in word sense disambiguation 4 is 
related to WoridViews* use of locality 
among thesaurus entries to determine word 
or phrase senses. A key difference is that 
the WoridViews technique relies exclu- 
sively on the distances between entries and 
their interconnectcdness in the thesaurus. 
In the case of Hirst and Charniak. each 



node in the semantic net is a frame con- 
taining linguistic and conceptual knowl- 
edge to be used in the disambiguation pro- 
cess. Also, WoridViews bases its decisions 
about word sense on a wider context, es- 
sentially paragraph -sized, while Hirst and 
Charniak use a sentence- sized context. 

. Belew's connection ist approach to informa- 
tion retrieval 3 is another example of work that 
is related to the WoridViews approach. 

Machine- readable dictionaries in 
word sense disambiguation. A number of 
approaches to word sense disambiguation 
use machine-readable dictionaries. 6 Slator 
advocates processing dictionary entries to 
extract a semantic network, which in turn 
is used in the disambiguation task as pan 
of the overall natural language processing 
task. 7 Other researchers have developed 
approaches that work directly from a dic- 
tionary's contents. 1 * 9 Lesk, for example, 
counts the number of words shared by dic- 
tionary definitions of the competing senses 
of pojyscroous words.* 1 This overlap is tak- 
en as evidence in favor of those senses. 

On the surface, the approaches of Lesk 
and of Wilks and colleagues 9 seem very 
different from WoridViews*. Nevertheless, 
the intuitions underlying these approaches 
appear to be similar. According to Wilks 
and colleagues, word senses that occur to- 
gether locally in a document are likely to 
be semantically related. For both sets of re- 
searchers, the degree of semantic relation- 
ship is measured by word co-occurrence 
statistics relative to dictionary definitions; 
WoridViews measures the degree of 



If WoridViews cannot resolve an ambi- 
guity in this way, it waits until the docu- 
ment has been completely scanned. While 
scanning, it keeps track of all unresolved 
ambiguities, gathering evidence that might 
contribute to their eventual resolution. 
WoridViews* policy of deferring decisions 
is meant to be used only with short 
documents such as abstracts or electronic 
news messages. Full-length documents such 
as papers and manuals are broken into 
groups of paragraphs totalling at least 150 
words, and then indexed separately. t 

The evidence that WoridViews accumu- j 
I ales concerning an ambiguous word or 
phrase W is the WorldLutticc distances I 
between the nodes for the various senses/ ; 
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of an actual WorldLattice node. The two 
general senses of "interference** do not 
occur as independent nodes, but only as 
constituents of the four nodes just men- 
tioned (that is, as subsenses). Thus, only 
four choices exist after all. However, there 
are situations in which the evidence sup- 
ports selection of a sense, but not of a 
subsense. For example, it is often easier to 
determine that a text is using "interfer- 
ence" in the sense of "'signal interference,** 
''wave interference,** and "electromagnet- 
ic interference,** than it is to determine 
which of these specific subsenses is 
intended. Indeed, some other subsense, not 
included in the WorldLattice at all, might 
be intended. Allowing the system to choose 



| a general sense in such situations will lead 
! to improved performance. This possibility 
will be incorporated in future versions of 
the system. 

WoridViews uses two techniques to re- 
solve polysemy. The first is quite simple: If 
the system sees a polysemous word in a 
known honpolysemous phrase, then it re- 
solves the ambiguity immediately (for that 
occurrence of the word). For example, 
**UHF* is an alias for the WorldLattice 
node ultra-high frequency, and "VHF*" is 
an alias for the node very high frequency. 
The word "frequency** is ambiguous, but 
I the phrases in question are not. so the 
j system can resolve these potential ambigu- 
ities immediately. 
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semantic relationship by distances be- 
tween senses relative to the thesaurus. 

In contrast to these knowledge-based 
approaches, recent work by Church and 
colleagues uses purely empirical statistical 
analysts of text to construct (and train) 
word sense discriminators. 10 

Rule-based and expert system ap- 
proaches. The Rubric system uses a rule- 
based approach to information retrieval. 11 
A collection of multilevel weighted rules 
is used to define a topic or concept. Rules 
at the lowest level define pauems of 
strings that are used to search text: higher 
rule levels relate these patterns to interme- 
diate concepts or abstractions needed to 
define the topic. Rule weights are used to 
compute relevance rankings. Rubric is not 
used for automatic indexing, nor does it 
explicitly deal with word sense disambig- 
uation. 

Rubric's rules are intended to capture 
experiential or operational knowledge of 
the sort that is contained in a successful, 
complex Boolean query or search strategy. 
Knowledge at this level tends to consist of 
varied bits and pieces, rolled together into 
a complex structure. For example, the 
World Series concept 11 is defined in terms 
of both world knowledge rules (for exam- 
ple. Saint Louis Cardinals is a baseball 
team) as well as linguistic rules (such as 
"St. * can abbreviate "Saint"), Thus, even 
though the Rubric rule set grows larger 
and larger, it really never contains a pure- 
ly declarative knowledge representation 
corresponding to WoridVicws" thesaurus. 



Because it uses this kind of knowledge 
representation, WorldViews can support 
features such as subtopic breakdowns, 
conjunctive-expression expansion, and 
the automatic integration of the results of 
such expansions into its thesaurus struc- 
ture. 

Croft and Thompson have looked at us- 
ing expert systems as intelligent assistants 
or intermediaries for information retrieval 
systems. 12 Gauch and Smith also consider 
the role of a thesaurus in automatic query 
reformulation by an expert system inter- 
mediary. 13 
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subsenses of Wand the nodes for the sense/ 
subsense of unambiguous words (or senses/ 
subsenses of other ambiguous words) in 
the document. These distances are translated 
into scores, with shorter distances yielding 
higher scores. After all the evidence is in, 
the sense/subsense with the highest score 
is selected if 

(1) the score surpasses a certain threshold 
(currently set at 3), and 

(2) the difference between this score and 
the score for the next highest compet- 
ing sense/subsense is greater than the 
latter* s score. 

Thus, if the highest score is 6 and the next 
highest score is 4, no decision will be made 



. unless the word has a default sense/subsense, 
; in which case the default will be chosen, 
i The intuition behind this approach is 
that the WorldLaliice nodes representing 
the word senses in a document are often 
huddled together rather than spread apart. 
Consider the sentence "The program has 
courses that are interdisciplinary in na- 
ture." Intuitively, the sense of the word 
"program" corresponding to "computer pro- 
gram 0 has nothing to do with the sense of 
the word "courses" corresponding to 
"meal." This is reflected in the large 
distance separating the nodes correspond- 
ing to these senses in the WorldLatticc. 
On the other hand, the sense of the word 
"program** that corresponds to "academic 



program** and the sense of the word "courses" 
that corresponds lo "curriculum" do have 
something to do with one another. In the 
Worldl ^ttice. these concepts are siblings 
under the concept curriculum. 

So, we can interpret ambiguous words 
by looking for word senses whose nodes 
minimize the sum of the distances between 
the selected nodes (including the nodes for 
the document's unambiguous words). 
However,, the actual technique used is sim- 
pler and more efficient. It is based on the 
notion of a least upper bound of two World- 
Lattice concepts. Let c, C|, c 2 be three 
WorldLatticc nodes. There is a downward 
path from c*| to c 2 if c t = c 2 . or if cj can be 
reached from c, by following a sequence of 
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Table I. DtsoRhMgiotiea of tU word fnterfertnct.' The thm «»ptting subtenses for iMs 
obstnid ore svbeost 1, shjsd fef erfertiKt (comimmkctww); subsMSt % wove Interfer eice 
(fhyskf); and sabseose 3, pass bterftftnce (footbal). 



Cowaied wm 


MMRHPUVl * 


1 CAtr titan Mmn( 


miX MSTAJKf 


UUMGf IN KOtt 


UHF 


No 


Transmission 


2 


♦3 to subsense 1 






Wave phenomena 


2 


+3 to subsense 2 


Television receivers 


No 


Telecommunications 


2 


♦3 to subsense 1 


Television channels 


No 


Telecommunications 


3 


+3 to subsense 1 


Receivers 


Yes 


Football 


2 


+1 to subsense 3 






Telecommunications 


2 


♦1 to subsense 1 


Transmitters 


No 


Transmission 


2 


+3 to subsense 1 


Channels 


Yes 


Transmission 


1 


♦2 to subsense 1 


Signal 


Yes 


Transmission 


1 


+2 to subsense 1 


VHF 


No 


Transmission 


2 


+3 to subsense 1 






Wave phenomena 


2 


+3 to subsense 2 



Total score for subsense 1 = 18 Total score for subsense 2*6 Total score for subsense 3 = 1 



more-speciftc-than links. Now, node c is a 
least upper bound of c x and c 7 i f and only if 

( J ) there is a down ward pathp, from c to c,, 

(2) there is a downward path p 2 from c to 
cj, and 

(3) any other node c' that satisfies ( I ) and 
(2) is not in the paths p x or p 2 . 

The paths p t and pj are called legs; the 
number of links traversed along a leg is the 
distance associated with that leg. 

Table I shows the scoring process for 
"interference"* in ourexample. The ambig- 
uous word is compared once against every 
word in the document. In particular, com- 
paring "interference" with "UHF* involves 
computing the distance from each of the 
three nodes corresponding to the three sub- 
senses of "interference** to the single node 
UHF. As a measure of the distance, we take 
the maximum one-leg distance to the near- 
est least upper bound. The largest distance 
for which an increment will be generated is 3. 
Scores for comparisons against known un- 
ambiguous words are higher than for com- 
parisons against ambiguous words. A dis- 
tance of 1 when compared with an 
unambiguous word yields an increment of +5. 
whereas the same distance with respect to 
an ambiguous word yields an increment of +; 
a distance of 2 yields increments of +3 and 
+ 1. respectively, and a distance of 3 yields I 
increments of +1 and +1, respectively. A • 
distance of 0 can occur if an unambiguous 
phrase using an ambiguous word ( for exam- 
ple, -television receivers") appears with an 
arabi guou s occurrence of that word (such as ! 
'receivers**) in the document. As shown in | 
Table I, the distance from the node signal \ 
interference to the nearest least upper bound | 
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with UHF , via transmission, has a distance 
of 2. Since the term "UHF* is known to be 
unambiguous, this comparison yields an 
increment of +3 to the score for tbe corre- 
sponding subsense of "interference" (sub- 
sense 1). The second subsense. "wave in- 
terference." also has a least upper bound 
with UHF. namely wave phenomena, with 
a maximum one-leg distance of 2. None 
I of the scores for the other subsenses of 
I the word are incremented as a result of 
} this comparison because the distances are 
! too large. In this case, an increment of 
\ +10 is awarded to the sense/subsense in 
question. 

I<et's look at one final detail of this 
procedure. The results of comparisons with 
more specific subsenses of a word or phrase 
are considered only when they yield dis- 
tinct least upper bounds from those derived 
from comparisons with any more general 
sense/subsense. Thus, in our example, the 
term "electromagnetic interference** is a 
child of **wave interference.* 1 but every 
comparison with cither term yields exactly 
the same least upper bound. Intuitively, 
this means there is no independent evi- 
dence supporting the more specific sub- 
sense as opposed to the more general one. 

The technique I have described involves 
choosing values for a number of parame- 
ters: for example, the size of the scores, the 
maximum distances allowed. 3nd so on. 
The absolute values assigned to Ihe pa- . 
rameters appear far less important than • 
iheir,relative sizes. For example, smaller ' 
distances between senses or subsenses I 
should yield higher scores, and compari- 
sons with senses or subsenses known to 
occur in the text should yield higher scores | 



than comparisons with polysemous senses 
or subsenses. 

Experimental results. I evaluated this 
word sense disambiguation technique us- 
ing about 12.000 Bell Labs abstracts. I 
gathered two types of statistics: general 
performance, and performance on specific 
examples of ambiguous words. I also col- 
lected experimental evidence concerning 
the change in performance caused by chang- 
es, to thesauri that were made to correct 
specific performance problems. The sys- 
tem detected 39,138 polysemous word or 
phrase occurrences, and picked a word 
sense in 32,157 of these cases, or about 82 
percent of the rime. (This number drops to 
72 percent if default senses are not used). 
Only 1,914 of these cases were decided 
immediately on the basis of a polysemous 
word being part of a larger single-sense 
phrase; the rest required the use of World- 
Lattice locality, as described earlier. 

How often did the system make a good 
choice? Obviously, it was not practical to 
check all 32,157 cases. Instead, I selected 
and checked a random sample of 200 cases. 
The system made good word sense choices 
77 percent of the time over this sample, 
which can be extrapolated to the entire set 
with an error range of ± 5 percent. This 
general level of performance is compara- 
ble to those reported elsewhere. 7 *" 

However, the current WorldLattice is 
incomplete relative to the set of documents 
indexed. We would really like to know the 
level of performance that can be expected 
by a version of WorldViews that uses a 
relatively complete Worldl^ttice with re- 
spect to the documents indexed. While I 
cannot answer this question at this time, the 
following experiment shows that the sys- 
tem's performance tends to improve as the 
WorldLattice is built up. Afler analyzing 50 
electronic news messages, the system de- 
tected .153 polysemous word occurrences 
and resolved 102. 1 checked a random sam- 
ple of 20 of these decisions: 13 were good 
choices, and seven were poor. So far. the 
experiment has the same form as the one 
involving the much larger collection of ab- 
stracts. Using this smaller sample, how- 
ever, we can experimentally address the 
issue of convergence. Suppose we try to 
correct the seven poor decisions by expand- 
ing or altering the WorldLattice in appro- 
priate ways; what effect would these changes 
have on the system's performance on other 

(HE EXPERT 



I 




Signal 
interference (2) 



Television Receivers (4) 
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cases? First of all. the seven problem cases 
were easily corrected by expanding the 
WorldLattice contents. I then ran the 
system over the same 50 messages using 
the updated WorldLattice, and compared 
the differences in log files from the iwo 
runs: Not only were the seven original bad 
decisions corrected, but 14 additional 
bad decisions were corrected as well. 
In no case did the system go from making 
a good word sense decision to making a 
bad one. 

Thus, expanding the WorldLattice to 
correct problems will generally not cause 
problems with cases that the system was 
already handling correctly. Therefore, the 
level of performance achieved with the 
current WorldLattice is only a lower hound 
on the level of performance that can be 
achieved using this technique. As the World- 
Lattice comes to represent a more com- 
plete and accurate mapping of the domains 
of the documents it analyzes, the level of 
its word sense disambiguation performance 
should increase correspondingly. 

This expectation has also been borne out 
by experiments with particular ambiguous 
words. For example* the word "current" 
has two senses in the tnspec thesaurus. 
Initially the system performed at about a 
70-percent level of accuracy for occurrenc- 
es of this word in the Bell Labs collection of 
1 2.000 abstracts. Adding a number of new 
entries to Ins pec to fix these problems brought 
the level of correct word sense decisions up 
to ¥0 percent for this word. 



I Phase two. The system uses the explicit | 

I concept references found in the first phase 
to generate a map of the document that also 

| includes the concepts only implicitly refer- 
enced by the texL Since this map will be a 
proper part of the WorldLattice, it is called 

. the implied document sublattice. 

I The implied sublattice is computed by 
beginning with the explicit concept refer- 

| ences and working up, from NT to BT 
nodes. Each node tracks the number of 

! times it has been visited, and updates its | 
frequency with every visitation. (The sys- ' 

j tern uses these values to estimate the rele- 
vant content of retrieved documents in re- 

j sponse to user queries.) Si nee every upward [ 
chain of nodes in the WorldLattice termi- j 
nates at the root node, and there are no 
cycles in this thesaurus, the process must 

I halt. The frequency computed for the root 

* node is catled the total content. 

I Once this upward activation is complete* 
the system indexes the document by tra- 
versing the sublattice downward: Excluding . 
the root node, the system starts by finding 
the set of most general implied nodes in 
, the subtatticc that were visited at least 

two times on the way up. Fach of these . 
: nodes and alt their descendants are used 
I to index the document. As each of these | 
nodes is visited, its percentage of relevant > 
content is computed by dividing its fre- 
quency hy the. total content and multiply- ' 
ing by 100. These numbers, the content i 
! number* for the corresponding nodes, are I 
stored on disk with the document postings 



for use by the information retrieval sub- 
system. 

Figure 4 shows the implied sublattice for 
our example document. The nodes in boxes 
are the document's explicit concept refer- 
ences; the number in parentheses is the con- 
cept's frequency of occurrence in the text. 
Nodes not in boxes arc implied concept refer- 
ences; their frequencies are simply the sum of 
the numbers for the nodes below Ihem. 

Information retrieval 

As The earlier examples showed. World- 
Views* approach to information retrieval 
is based on using the WorldLattice as a 
conceptual map of an information space. 
The key task of World Views* information 
retrieval system is to interpret a user's 
query relative to the WorldLattice, and 
then access and (if needed) compute the 
subspacc relevant to the query. The system 
also facilitates iterative user navigation 
through neighboring subtopics. 

Query interpretation. Queries are in- 
terpreted via a simple inverted file search 
relative to the WorldLattice (or other in- 
dexing thesaurus), together with a sub- 
sumption analysis to produce the list of 
the most general thesaurus entries that 
match the given terms. Suppose, tor exam- 
ple, the query is "elementary particles.** 
The system will initially look for every 
thesaurus entry containing these terms. In 
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the process, any entry that is determined to 
be subsumed by a more general entry found 
earlier will not be included. For example, 
Inspec contains the entries "elementary 
particle interactions** and "elementary par- 
ticle weak interactions.** The latter is nar- 
rower than the former, and so only the 
former entry will be included in die initial 
response to the query. The latter entry is, of 
course, accessible via subtopic expansion. 

Using inverted file search allows a great 
degree of flexibility in query entry; word 
order, for example, has no effect on the 
response to a query. This also lets users 
type partial terms with identical results: 
typing "clem parti" will yield the same 
results as "elementary particles.** 

A more subtle, but equally or more im- 
portant, advantage of using inverted file 
search is its ability to handle ambiguous 
queries. When a user enters an ambiguous 
query, the system simply returns all the 
most general matching entries for every 
possible sense. The returned topics contain 
all the information needed for the user to 
select the entry or entries that agree with 
the intended word senses. (Identical am- 
biguous entries are distinguished by dis- 
playing the names of their more general 
entries along with them.) 

In forming its interpretation of an infor- 
mation request, the system constructs the 
most specific interpretation possible. For i 
example, ifa user types in the query "physics ' 
quantum mechanics, - the system will ig- I 
nore the broader term "physics." Thus the ' 



interpreted query consists solely of "quan- 
tum mechanics,** and the system will return 
the thesaurus entry for that topic. 

Conjunctive expressions. In the exam- 
ple in Figure 5, the user has typed the query 
"marketing telecommunications " World- 
Views interprets this as a conjunctive ex- 
pression, that is, a conjunction of two known 
thesaurus entries, "marketing** and ''tele- 
communications." These terms are logi- 
cally independent: neither subsumes the 
other in the thesaurus, that is, neither term 
is reachable from the other following the 
links in the thesaurus in a single direction. 
If the terms were not independent, only the 
more specific term would be used in pro- 
cessing the query. 

The system handles conjunctive expres- 
sions in two ways, if it has already seen and 
processed the expression, those results are 
still available and are retrieved (I will elab- 
orate on this shortly). For now. assume that 
the query is entirely new to the system. To 
generate a response, the system computes 
the intersection of the document postings 
contained under the terms in the interpreted 
query. The resulting list is displayed in the 
Retrieved Titles window (see Figure 5). 
When computing relevant content percent- 
ages for documents retrieved by conjunc- 
tive expressions, the information retrieval 
system uses the minimum of the content 
numbers associated with the document. In 
this example, the single item retrieved has 
a 7.7-percent content for "marketing" and 



a 34.6-percent content for ''telecommuni- 
cations,** resulting in a 7.7-percent content 
with respect to the conjunctive expression. 

Computing expansions of conjunctive 
expressions. To give the user more specific 
subtopics, the system needs to expand its 
thesaurus around the conjunction "market- 
ing & telecommunications.** Such expan- 
sions are calculated as needed, either in 
response to a query interpreted as a con- 
junction of logically independent World- 
Lattice nodes, or in response to a subtopic 
selection using the mouse. The six new 
conjunctive expressions generated by this 
process are listed in the Subtopics: Level I 
window in Figure 5; they represent all the 
closest logically minimal conjunctive ex- 
pressions subsumed by "marketing & tele- 
communications*' that actually describe 
documents in the collection. The notion of 
closeness of conjunctive expressions is 
defined as follows. Let E y A, and B be 
conjunctive expressions. A is closer to E 
than B if and only if E subsumes A and 
either £ does not subsume B oM subsumes 
B. Using the example in Figure 5, if E 
equals "marketing & telecommunications,'* 
A equals "marketing research & satellite 
communications,** and B equals ''market 
survey & communications satellite,** then 
the terms of the definition are satisfied. 

The user now selects the subtopic "mar- 
keting research & satellite communica- 
tions." This causes the system to iterate the 
preceding steps with respect to this partic- 
ular conjunctive expression; that is, the 
document titles in the relevant intersection 
are computed and displayed, and the sys- 
tem expands its thesaurus around the ex- 
pression to yield two new (more specific) ; 
expressions in the Subtopics: Level 2 win- j 
dow. The user then selects the single dis 
played title to bring up the text. 

The expansion technique is fairly straight 
forward (see the schematic example in Fig- | 
ure 6). Suppose the user has issued a query l 
that the system interprets as a new con- 
junctive expression "A & B" where X and 
B are logically independent thesaurus en- 
tries. For each term to the conjunctive 
expression, the system finds all the near- 
est, more specific thesaurus entries that 
contain document postings. For example, 
in Figure 6 the relevant terms for A are A2 1 , 
A 12, and C, A22 is not included because 
A 1 2 is a closer node to A along the same 
path; A13 contains no document postings. 
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The system then goes through all ihe 
postings associated with these entries. For 
each document, it keeps track of all the 
entries thai post it {"Applicable nearest 
term*** in Figure 6). Only documents I and 2 
are posted by suhierms falling under both A 
and B. This accounts for the new conjunc- 
tive expressions generated in the following 
line in the figure: two conjunctive expres- 
sions have been generated for document 2. 
Finally, the term C which is a narrower 
term of both A and tf. will he included in the 
expansion of 'A & £'* even though it is not 
a conjunctive expression itself. 

As mentioned earlier, once WorldVicws 
has expanded its thesaurus around a new 
conjunctive expression, the results are auto- 
matically integrated into its thesaurus struc- 
ture, so that the system does not need to 
duplicate work. The system can do this 
because the logical relationship between 
any two conjunctive expressions, or be- 
tween a conjunctive expression and a sim- 
ple thesaurus entry, is completely deter- 
mined by the known logical relationships 
among the thesaurus entries used in con- 
junctive expressions. 

While this procedure is straightforward, 
it can be expensive to carry out: the com- 
plex it y of the worst case is governed by the 
product of the largest number of indepen- 
dent sub terms that can occur under each 
term to he expanded. Simply dealing with 
a large number of postings can slow down 
performance. Nevertheless, experience with 
12.0(H) Hell Labs abstracts has been quite 
reasonable. Computing the expansions for 
relatively specific queries, for example, 
"quantum well & quantum tunneling & 
VLSI." leads to nu slowdown in perfor- 
mance. Computing a general expansion 
such as "telecommunications & Asia," takes 
about 5 to 10 seconds. 

While more experience with larger collec- 
tions is needed to better judge the average 
compute limes for expansions, it is unlikely 
that the technique described here can be 
applied uniformly in real time. Depending 
on the circumstances, however, numerous 
satisfactory options are available. For ex- 
ample, the user could be invited to select a 
particular conjunctive expression from a 
menu, rather than wailing for the system to 
compute all the conjunctive expressions 
generated by expansion of some expres- 
sion. Alternatively, the system could carry 
out the computation during times of mini- 
mal usage and notify the user on completion. 




Document number 



Applicable nearest terms A21 , B21 

New conjunctive expressions generated A21 & 821 



A21.A12.B12 
A21&B12 
A12 & 812 



flgvre 6. Schematic example of a conjwictive e itmsiH expansion. 



The results of the experiments with 
WorldVicws compare favorably with oth- 
er techniques. They also indicate that bet- 
ter performance can be expected with a 
more mature thesaurus. 



General versus specific queries. These 
examples show that World Views gives rea- 
sonable, well-structured responses to gen 
eral queries such as ''telecommunications. 
With large collections of Bell Labs docu- 
ments, this and similarly general queries 
cause World Views to access thousands of 
document titles. As one would expect, ad- 
dressing such queries to a keyword system 
such as Slimmer returns a very small frac- 
tion of the applicable documents. The rea- 
son, of course, is that most technical docu- 
ments about a specific subtopic in 
telecommunications never explicitly use 
the word ''telecommunications/* 

What makes the WorldVtews response 
to such queries reasonable, however, is not 
the large number of relevant titles returned, 
but the inherently structured nalurc of the 
response. The user does not need to scruti- 
nize numerous titles or documents; the 
subtopics navigation mechanism allows the 
user to easily descend to more specific 
levels, or back up to follow other paths. 
The list of retrieved titles changes accord- 
ingly: as more and more specific subtopics 
are selected, fewer and fewer titles remain. 
Documents with low relevance to the ini- 
tial general query, but with higher rele- 
vance to the more specific selected subtop- 
ics, will move up the list of remaining 
titles, as can be seen by comparing the 
titles in Figures I and 2. 



What about the opposite problem: Can 
Worid Views generate a reasonable response 
to specific queries? Typically, the respons- 
es to such queries contain few. if any, 
document titles. Nevertheless, the collec- 
tion might include closely related docu- 
ments that would also address the user's 
underlying information needs. For exam- 
ple, returning to Figures I and 2. if the user 
had begun by issuing the relatively specific 
query **cros stalk." then only one document 
title would be retrieved, with no subtopics. 
However, "crosstalk"* is a narrower topic 
falling under the broader topic of "signal 
interference." If "signal interference** or 
any narrower subtopic under it contains 
additional document postings, we should 
make those possibilities apparent to the user 
by including the additional items and subtopic 
windows in the display, and printing an 
appropriate message. (This will be included 
in future versions of the interface.) 

| Hybrid technique. If none of the query 
can be interpreted relative to WorldVicws* 
thesaurus, ihe entire character string is 

(handed to Si immer's keyword search facil- 
. ity. and the results are returned to World- 
Views for display. Of course, only titles, 
text, and messages can be displayed, since 
no thesaurus subtopics can be retrieved. 
Suppose, on the other hand, that part of the 
query is known to WorldVicws and part is 
not. Say. for example, thai "teleeommuni- 
: cations" is known to WorldViews, but 
"marketing" i* not. In this case, the system 
would hand the unknown part to Slimmer, 
which would use its* inverted index and 
• return a list of postings, that is. pointers to 
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documents containing the term. If more j 
than one term is given to Slimmer, (he ; 
intersection of the postings is returned. In : 
the meantime. World Views handles the : 
rest of the query as though the other portion , 
had not been given at all. If Slimmer re- 
turns no postings, then the whole query ' 
fails. Otherwise, as before. WorldVicws 
computes the intersection of the document 
postings it finds with those returned by : 
Slimmer. Thus, in this example. World- . 
Views first computes the intersection of ■ 
the postings for "marketing" with those for 
"telecommunications" and lists the results . 
. in the Titles window. It also computes this 
intersection for each Level I subtopic found 
for "telecommunications/* for example, 
"marketing & satellite communications " 
Thus, in the case of a partially interpreted 
query, WorldVicws can still perform par- 
tial expansions of conjunctive expressions, 
hut the unknown portion of the query re- 
mains the same in all the subtopics gener- 
ated in all levels. 

Performance 

The rate at which the automatic-index- 
ing system processes text depends on the 
properties of the lai I ice-structured Ihesau- * 
rus being used to do the indexing. Clearly, j 
a thesaurus with greater average depth will j 
tend to increase the amount of processing I 
needed to construct the implied document | 
sublattice: average node connectivity is j 
also a factor here. The number of nodes in ! 
the lattice <and the number of aliases) will : 
also affect processing time, since the more . 
nodes in the lattice, the more likely it is that 
concepts will be found in the text. 

I lowever, even in the worst case (where . 
every concept in a document somehow forces 
the algorithm to consider every other node in 
the indexing lattice), the amount of process- 
ing time grows only linearly with the num- 
ber of concepts. This is also true of the word 
sense disambiguation part of the algorithm. 

Experimental results also confirm that 
the indexing algorithm is fast. When using 
the World Lauicc as the indexing thesau- 
rus, the algorithm processes text at the rate 
of about 30 million characters an hour (in 
processing time) on a machine performing 
25 million instructions per second. I de- 
rived this figure by processing roughly 
12.000 Be 1 1 Labs abstracts containing about 
15 million characters of text in about 30 



minutes. In comparison, just creating an 
inverted index for the same collection takes 
about 15 minutes of processing on the 
same machine. Results using In spec as the 
indexing thesaurus arc slightly faster, rough- 
ly. 45 million characters per hour. Even 
though Ins pec contains more than twice as 
many nodes as the WorldLattice. it is on 
average a shallower lattice, and this ac- ! 
counts for the difference in processing time. ! 
The average distance of a node to the World- • 
Lattice root is 8.230, while the average ; 
distance of a node to the root of the Inspec 
lattice is 2.659. 

During indexing, the system keeps a 
representation of the entire thesaurus in 
memory. The space required is about 4 
Mbytes for the WorldLattice and about 8 
Mbytes for Inspec. We foresee no prob- 
lems in ultimately accommodating thesau- 
ri of up to 20,000 entries, while maintain- 
ing a comparable rate of throughput. Since 
query interpretation proceeds via an in- 
verted file approach, entire thesauri do not 
need to be memory-resident during retrieval, 
but are brought into memory as needed. 
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ISTORICALLY. WORK IN IN- 
formation retrieval has emphasized two 
measures of retrieval effectiveness: recall 
ond precision. While the system outper- ; 
forms traditional keyword-based retrieval 
systems, comparisons with state -of- the -art 
information retrieval approaches have yet 
to be undertaken. Plans are underway to 
compare it to the Smart system, which uses 
the vector space approach.' or the Inquery 
system, which is based on the inference net 
approach. 10 

With the increasing emphasis on the role 
of moving and managing information in 
today's world, information retrieval should ! 
focus not only on techniques for improving 
precision and recall, but also on techniques 
and systems that educate. In other words, 
information retrieval should provide not 
only better access to documents for users 
who have a good idea of what they are 
looking for. but also useful and knowl- 
edge-imparting responses to users who need 
to learn about some topic or area with , 
which they have little or no familiarity. J 
World Views is intended to further that ! 
mission. 



Acknowledgments 

1 thank the reviewers for their critical corn- 
menu; on this work. I thank Les Lunas and Bob 
Waldsiein for access to Linus databases and the 
Slimmer system, and their comments on and 
interest in this work. I also thank David Lewis 
for his comments and suggestions. 



References 



1. M. Billiard. H. Landau, and A. Tamir. "The 
Hierarchical Indented Thesaurus System." 
Proc. ASiS Annual Meeting 1971. pp. 373- 
380. 

2. F.W. Lancaster. "Thesaurus Construction 
and Use: A Condensed Course.** tech. re- 
port. Unesco, I98S. 

3. C. Salton. "Experiments in Automatic The- 
saurus Construction for Information Re- 
trieval," Information Processing, VoL 7 1 , 
1972. 

4. P.R. Cohen and R. Kjeldseo. "Information 
Retrieval by Constrained Spreading Acti- 
vation in Semantic Networks." Informa- 
tion Processing and Management, Vol. 23, 
No. 4. 1987. pp. 255-268. 

5. R.K. Waldstein, "Slimmer: A Unix-Sys- 
tem- Based Information Retrieval System." 
Reference Services Rev,, Vol. 1 6, 1 988, pp. 
69-76. 

6. Inspet Thexaurux. I HHP. Service Center. 
Piscatuwuy. N.J.. 1991. 

7. M. Lesk. "Automatic Sense Disambigua- 
tion Using Machine- Readable Dictionar- 
ies: How To Tell u Pine Cone from an Ice 
Cream Cone." Proc. St'xDoc. ACM, New 
York. 1 986; pp. 24-26. 

8. Y. Wilkset al.. "Providing Machine-Trac- 
table Dictionary Tools," Machine Transla- 
tion. Vol. 5. No. 2, June 1990, pp. 99-154. 

9. G. Salton. cd.. Jnc Smart Retrieval System. 
Prentice-Hall. CnglewoodClrffs.NJ.. 197 1 . 

10. H. Turtle and W.B. Croft, "Inference Net- 
works for Document Retrieval," 13th An- 
nual tnt't Conf Research and Develop- 
ment in Information Retrieval, ACM. New 
York, 1990, pp. 1-24. 



Allan Ginsberg is a 

member of technical 
stuff at AT&T Bell Lab- 
oratories. His research 
interests include infor- 
mation retrieval, multi- 
media indexing, and 
machine learning. He 
recieved a PhD in com- 
puter science in 1986 and 
a PhD in philosophy in 
1983. both from Rutgers University. He can be 
feached at AT&T Bell Laboratories. Room 4G- 
6l4.CrawfordsComer Road. Holmdel. NJ 07733: 
e-mail nbg@researcli.alt. con 




56 



lEBfXMKT 



I 



IEEE Xplore Abstract/Citation 



Page 1 of 1 



< Back to Previous Page 



A unified approach to automatic indexing and information retrieval 
- Ginsberg, A 

AT&T; Bell Lab., Holmdel, NJ, USA 

This paper appears in: IEEE Expert [see also IEEE Intelligent Systems] 

On page(s): 46 - 56 
Oct. 1993 
Volume: 8 Issue: 5 
ISSN: 0885-9000 
References Cited: 10 
COD EN: IEEXE7 

INSPEC Accession Number: 4562664 



Abstract: 

WorldViews, an experimental system that unites as many aspects of information organization as possible around a simple, 
familiar, yet versatile knowledge representation framework, is discussed. This framework is a lattice-structured version of 
the traditional thesaurus. WorldViews focuses on the unified use of the thesaurus to automatically index and retrieve 
information. The performance of the WorldViews system is outlined. 
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